Do Elon Musk’s Tweets Have an Influence on Tesla’s Stock Price?


Not long after sending out this tweet on 7 August 2018, Elon Musk was facing a lawsuit by the U.S. Securities and Exchange Commission (SEC) for making “false and misleading statements” and for market manipulation. Musk implied with the tweet that he secured an offer to take Tesla private at the stated price, which was substantially above its actual trading price, although no real arrangements had been made, nor any offers met. According to the SEC his tweet “set off a trading frenzy”, and pushed Tesla’s stock price up by more than 6 percent, forcing the NASDAQ exchange to halt Tesla trading for 90 minutes until the company gave an official response. The company’s stock price closed at $379.57 on the day of the tweet. Two months later, Musk agreed to a settlement which required him and Tesla individually to pay a 20$ million fine and he in addition had to step down from Tesla’s board. 1


Elon Musk by Dr. Seuss (Poem)

The SEC said:
”Musk, your tweets are a blight.
They really could cost you your job,
if you don’t stop
all this tweeting at night."

…Then Musk cried:
“Why? The tweets I wrote are not mean,
I don’t use all-caps and I’m sure that my tweets are clean.”

“But your tweets can move markets
and that’s why we’re sore.
You may be a genius and a billionaire,
but that doesn’t give you the right to be a bore!”

— OpenAI, co-founded by Musk, AI-generated poem


What’s our Goal?

The idea of this R notebook is to introduce everyone interested in data science and machine learning to effective communication of data analysis and statistical findings by leveraging suitable visualisations. On the side, we also take a look at a way to analyse the evolution of Tesla’s stock price and the influence Elon Musk’s tweets had on Tesla’s stock price. For the purpose of visualising analyses and findings, the ggplot2 and plotly packages (as well as some additional packages) are used since they enable producing high-quality, publication-ready visualisations for static as well as dynamic and interactive applications. Both packages are built around the framework of the so-called Grammar of Graphics, a scientific syntax for effective data visualisations, which describes how specific elements or components of a plot should be separated and classified for a structured approach to visualisations. For more information, see Hadley Wickham (2010) - A Layered Grammar of Graphics and Wilkinson (2011) - The Grammar of Graphics.

I can also greatly recommend these following resources:

#

# TODO: Add time of stock split to time series, search for short seller tweets in data, add a log scale plot
# to Tesla's stock price chart, add most recent stock return on distribution, think about colour choice (restrict it)

1 Settings

# Turn off warning messages

options(warn = -1)

# Custom function for checking installation of packages and loading them

install_and_load_package <- function(package) {

    # Check whether package is already installed and if not, install it

    if (!require(package, character.only = T)) {

        install.packages(package, dependencies = T)

    }

    # Load specified package

    require(package, character.only = T)

}

# Specify packages needed for analysis in character vector

packages <- c("conflicted",
              "foreach",
              "doMC",
              "gapminder",
              "httr",
              "rtweet",
              "quantmod",
              "pins",
              "tidyverse",
              "lubridate",
              "tsbox",
              "tidytext",
              "qdap",
              "lmtest",
              "sandwich",
              "caret",
              "DT",
              "ggrepel",
              "plotly",
              "wordcloud2",
              "fmsb",
              "rayrender",
              "rayshader",
              "viridis",
              "viridisLite",
              "RColorBrewer")

# Install and load needed packages

lapply(packages, install_and_load_package)

# Conflicted: hierarchy in case of conflict

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("first", "dplyr")
conflict_prefer("last", "dplyr")
conflict_prefer("lag", "dplyr")
conflict_prefer("flatten", "purrr")
conflict_prefer("layout", "plotly")

# Parallel computing settings: Using maximum number of available cores

n_CPU_cores <- detectCores()

registerDoMC(cores = n_CPU_cores)

# Color settings

palette(viridis(n = 10))

col_palette_red    <- brewer.pal(n = 9, name = "OrRd")

col_palette_yellow <- brewer.pal(n = 9, name = "YlOrRd")

col_palette_green  <- brewer.pal(n = 9, name = "YlGn")

col_palette_blue   <- brewer.pal(n = 9, name = "PuBu")

2 Data Input

# Some options for quantmod package

options("getSymbols.warning4.0" = F)

To start with, we get Tesla stock data (ticker = “TSLA”) from Yahoo Finance by using the quantmod package. All that is required to download the data is the ticker of the corresponding financial instrument.

getSymbols(Symbols = "TSLA",
           src     = "yahoo",
           verbose = F)

Second, we also get S&P 500 index (SPY ETF) data (ticker = “SPY”) from Yahoo Finance.

getSymbols(Symbols = "SPY",
           src     = "yahoo",
           verbose = F)

And finally, we download NASDAQ index data (ticker = “^IXIC”) from the same source.

getSymbols(Symbols = "^IXIC",
           src     = "yahoo",
           verbose = F)

Next, we do some data wrangling to transform Tesla stock data into a tibble with the dplyr and tsbox packages and rename its columns. Tibbles are enhanced data.frames around which the tidyverse packages (and a great many other packages) are built. They provide a standardised way of storing data comming in diverse formats. I also use the pipe operator %>%, to make the workflow and required steps easy to grasp and adjust later on (see picture below for a short explanation).

df_Tesla_stock_data <- TSLA %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date     = time,
           Open     = TSLA.Open,
           High     = TSLA.High,
           Low      = TSLA.Low,
           Close    = TSLA.Close,
           Volume   = TSLA.Volume,
           Adjusted = TSLA.Adjusted)

The Tesla stock data now looks like this, with daily observations for each trading day organised in the rows and seven different variables, also called features in the machine learning (ML) context, in the columns. For each of the daily 2’615 observations, we have the corresponding date in the Date column, the Openning stock price at trading start on the exchange, the daily Highest and Lowest price, the Close at end of trading, the trading Volume, and finally an Adjusted price, accounting for stock splits, dividends, and similar corporate actions.

df_Tesla_stock_data %>%
    mutate(across(where(is.numeric), ~ round(., digits = 3))) %>%
    datatable()

We do the same for the S&P 500 (SPY ETF) index data as well as the NASDAQ index.

df_SPY_data <- SPY %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date     = time,
           Open     = SPY.Open,
           High     = SPY.High,
           Low      = SPY.Low,
           Close    = SPY.Close,
           Volume   = SPY.Volume,
           Adjusted = SPY.Adjusted)

df_NASDAQ_data <- IXIC %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date           = time,
           OpenNASDAQ     = IXIC.Open,
           HighNASDAQ     = IXIC.High,
           LowNASDAQ      = IXIC.Low,
           CloseNASDAQ    = IXIC.Close,
           VolumeNASDAQ   = IXIC.Volume,
           AdjustedNASDAQ = IXIC.Adjusted)

The S&P 500 (SPY ETF) and NASDAQ index series have some more observations than the Tesla series, i.e. both have data points on 3’493 days. Otherwise, it is in the same format. Here is how the S&P 500 time series looks like:

df_SPY_data %>%
    mutate(across(where(is.numeric), ~ round(., digits = 3))) %>%
    datatable()

Finally, we add all three stock price and index time series together to have them available in a single tibble.

df_Tesla_SPY_NASDAQ <- df_SPY_data %>%
    full_join(df_Tesla_stock_data,
              by     = "Date",
              suffix = c("SPY", "TSLA")) %>%
    full_join(df_NASDAQ_data,
              by     = "Date")

2.1 Exercise 1: Get Apple stock data (hint: ticker = “AAPL”) and turn it into a tibble.

# getSymbols(Symbols = "AAPL",
#            src     = "yahoo",
#            verbose = F)
#
# df_Apple_data <- AAPL %>%
#     ts_tbl() %>%
#     ts_wide() %>%
#     rename(Date     = time,
#            Open     = AAPL.Open,
#            High     = AAPL.High,
#            Low      = AAPL.Low,
#            Close    = AAPL.Close,
#            Volume   = AAPL.Volume,
#            Adjusted = AAPL.Adjusted)

We now compute the (continuous) stock returns for all three financial instruments.

df_Tesla_SPY_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    mutate(ReturnsSPY    = log(AdjustedSPY) - lag(log(AdjustedSPY)),
           ReturnsTSLA   = log(AdjustedTSLA) - lag(log(AdjustedTSLA)),
           ReturnsNASDAQ = log(AdjustedNASDAQ) - lag(log(AdjustedNASDAQ)))

In addition, we scrap Tweets data from Elon Musk’s and Tesla’s official Twitter account with the rtweet package. Unfortunately, only the most recent 3’212 tweets per user are available, because Twitter limits access to historical data in order to commercially offer it instead. Tweet scrapping requires a Twitter account and a developer registration for the free Twitter API. This is fairly easy to set up, however, and should only take a couple of minutes (especially if you already have a Twitter account).

df_tweets_elon_musk <- get_timeline("elonmusk", n = 5000)

df_tweets_tesla     <- get_timeline("Tesla", n = 5000)

The Tweets dataset is rather big in size with 90 columns. Thus, only a subset of the columns are shown here to get an idea of how the data set for Elon Musk’s tweets looks like:

df_tweets_elon_musk %>%
    select(created_at, screen_name, text, source,
           is_retweet, favorite_count, retweet_count, hashtags) %>%
    mutate(across(where(is.numeric), ~ round(., digits = 3))) %>%
    datatable(filter  = "top",
              options = list(pageLength = 5,
                             autoWidth  = F))

…and Tesla’s official Twitter account:

df_tweets_tesla %>%
    select(created_at, screen_name, text, source,
           is_retweet, favorite_count, retweet_count, hashtags) %>%
    mutate(across(where(is.numeric), ~ round(., digits = 3))) %>%
    datatable(filter = "top",
              options = list(pageLength = 5,
                             autoWidth  = F))

3 Our First Plot - Time Series of Tesla’s Stock Price

Now we’re ready to take the Tesla stock price data and create a basic ggplot2 time series chart. We need the above mentioned Grammar of Graphics to set up each specific component in the plot. First, we need to map the data to so-called aesthetics in the plot. Aesthetics are defined within the aes() function in ggplot2 and include plot specifications such as what goes on the x-axis and y-axis, what is shown in which colour, how the size of an object in a plot is determined and many more. For our basic time series plot, we simply map the Date column from the stock data to the x-axis and the Adjusted stock price to the y-axis. The only additional component to add to get a finished plot now is a so-called geom (short for geometric objects). Geoms determine the kind of plot we want to display and are added with the set of geom_... functions. Here, we’d like to create a simple line plot with geom_line(). First, we add a new component to the plot by using the + operator. Then we set the line geom and after saving the plot to a new R object we have our first plot.

p_basic_time_series_Tesla <- ggplot(data = df_Tesla_stock_data,
                                    aes(x = Date, y = Adjusted)) +  # Close
    geom_line()

p_basic_time_series_Tesla

3.1 Exercise 2: Create a time series plot for Apple’s stock price. You can also try to adjust the axis scales, in case you have any idea how to do it.

# df_Apple_data %>%
#     ggplot(aes(x = Date, y = Adjusted)) +
#     geom_line() +
#     scale_x_date(date_breaks = "1 year",
#                  date_labels = "%Y") +
#     scale_y_continuous(labels = scales::dollar,
#                        breaks = seq(from = 0, to = max(df_Apple_data$Adjusted, na.rm = T), by = 20))

So far, so good. This is what we get by using ggplot2default settings. However, the plot doesn’t look particularly great, does it? The grey background is rather irritating, the date on the x-axis is only displayed every five years, it’s unclear in what units the y-axis is measured, and in general, there’s no title or anything to really indicate what is exactly shown here. The only information we have is the evolution of the series over a time period of 10 years and its corresponding values on the y-axis. We need to adjust some basic components of the plot.

For a visual overview and corresponding explanations of the different components in ggplot2’s Grammar of Graphics, see this Towards Data Science article:

Since we already defined our data and aesthetics components, we start by adjusting the scales of the x- and y-axes in a new component, the scales component. This ensures, we get proper units and labels for the x- and y-axis. We copy the code from above and additionally add scale_x_... and scale_y_.. functions with proper arguments.

p_basic_time_series_Tesla_w_scales <- p_basic_time_series_Tesla +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data$Adjusted, na.rm = T), by = 100))

p_basic_time_series_Tesla_w_scales

The theme of a plot is yet another component in the Grammar of Graphics. Setting a beautiful theme will help us to get rid of the irritating grey background. Let’s try the theme_classic() function.

p_basic_time_series_Tesla_w_scales_and_theme <- p_basic_time_series_Tesla_w_scales +
    theme_classic()

p_basic_time_series_Tesla_w_scales_and_theme

3.2 Exercise 3: Create the plot with the same theme and appropriately adjusted scales for Apple. Try adding a proper title to the plot.

# p_time_series_Apple <- df_Apple_data %>%
#     ggplot(aes(x = Date, y = Adjusted)) +
#     geom_line() +
#     scale_x_date(date_breaks = "1 year",
#                  date_labels = "%Y") +
#     scale_y_continuous(labels = scales::dollar,
#                        breaks = seq(from = 0, to = max(df_Apple_data$Adjusted, na.rm = T), by = 20)) +
#     theme_classic() +
#     labs(title    = "A Story of Success (and Steve Jobs)",
#          subtitle = "Apple's Stock Price",
#          y        = "Close (Adjusted)",
#          caption  = "© Data Science & Technology Club HSG")
#
# p_time_series_Apple

theme_classic() is quite a beautiful and simplistic theme. For the purpose of interpreting a time series plot, however, a theme including a grid may be more appropriate. Thus, in the following plots, we use theme_light() instead. We make sure the grid lines stay in the background of the plot by slightly fading them out, since they are only meant as supporting the viewer in identifying the scales on the axes. Next, we would also like to add a proper title. Plot main and subtitles as well as axis labels are set with the labs() function. In addition, we accentuate the x- and y-axis by plotting it in thicker size than the background grid lines. Let’s also adjust the label of the y-axis to make it clearer what it represents. Finally, we add a caption with a copyright for the plot. Now we have our first complete time series plot.

p_basic_time_series_Tesla_w_scales +
    theme_light() +
    theme(plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),  # thicker axes
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05)) +
    labs(title    = "Rising Higher and Higher...",
         subtitle = "Tesla Stock Price",
         y        = "Close (Adjusted)",
         caption  = "© Data Science & Technology Club HSG")

For the following plots, let’s set a global default ggplot2 theme, instead of adding it manually to each plot.

theme_set(theme_light())

To improve further on our plot, we can add a so-called benchmark to it. A benchmark is, e.g., another time series to compare the Tesla stock price to. We use the previously gathered S&P 500 prices to do exactly that. In order to be able to compare the prices of the two series and to get them into the same y-axis limits, some data wrangling and rebasing is required. While the S&P 500 is a sensible measure of the broad overall U.S. stock market to compare Tesla to, one could argue that Tesla is more of a technology company and thus should rather be compared to the NASDAQ index instead. Hence, we also include the NASDAQ index as a benchmark and add text labels and end points for the different time series.

df_Tesla_SPY_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    mutate(AdjustedTSLARebased    = AdjustedTSLA / first(df_Tesla_stock_data$Adjusted),
           AdjustedSPYRebased     = AdjustedSPY / first(df_SPY_data$Adjusted),
           AdjustedNASDAQRebased  = AdjustedNASDAQ / first(df_NASDAQ_data$AdjustedNASDAQ))

p_time_series_Tesla_vs_SPY <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = Date)) +
    geom_line(aes(y = AdjustedTSLARebased), col = col_palette_blue[6]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedTSLARebased)),
               col   = col_palette_blue[6],
               shape = 1,
               size  = 1.5) +
    geom_text(label = "TSLA",
              aes(x = last(Date),
                  y = last(AdjustedTSLARebased)),
              color = col_palette_blue[6],
              hjust = 1.4,
              vjust = -1) +
    geom_line(aes(y = AdjustedSPYRebased), col = col_palette_green[7]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedSPYRebased)),
               col   = col_palette_green[7],
               shape = 1,
               size  = 1.5) +
    geom_text(label = "S&P 500",
              aes(x = last(Date),
                  y = last(AdjustedSPYRebased)),
              color = col_palette_green[7],
              hjust = 1.4,
              vjust = -1) +
    geom_line(aes(y = AdjustedNASDAQRebased), col = col_palette_green[9]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedNASDAQRebased)),
               col   = col_palette_green[9],
               shape = 1,
               size  = 1.5) +
    geom_text(label = "NASDAQ",
              aes(x = last(Date),
                  y = last(AdjustedNASDAQRebased)),
              color = col_palette_green[9],
              hjust = 1.4,
              vjust = -2) +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::percent,
                       breaks = seq(from = 0, to = 110, by = 10)) +
    labs(title    = "Is Tesla's Stock Price an Inflated Bubble - Close to Bursting?",
         subtitle = "Tesla's Stock Price vs. NASDAQ and S&P 500 Benchmarks",
         y        = "Price Rebased (%)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_time_series_Tesla_vs_SPY

It is pretty impressive by how much Tesla’s stock price outperforms the (already well performing) S&P 500. In particular beginning in mid October 2019, the volatility of the stock increases immensely, the sharp rise is contrasted by a sharp decline and a sharp rise again. It remains questionable, if Tesla’s recent stock price appreciation is sustainable and warranted in the long run. Let’s highlight the time during which Tesla’s stock price increase was most notable in the chart. We can do this with the annotate geom. Highlighting areas or specific parts of a chart is a useful element in story telling with data (while engaging titles, proper labels, and colours are another part).

p_time_series_Tesla_vs_SPY +
    annotate(geom  = "rect",
             xmin  = as.Date("2019-10-15"),
             xmax  = last(df_Tesla_SPY_NASDAQ$Date) + 35,
             ymin  = -Inf,
             ymax  = Inf,
             col   = "grey",
             alpha = 0.05) +
    annotate(geom  = "text",
             label = "High Volatility Period",
             x     = as.Date("2020-04-15"),
             y     = -3,
             col   = col_palette_red[8],
             size  = 3)

3.3 Exercise 4: Add a title, subtitle, and some text or line annotations to your Apple chart.

# text_Apple <- "Nevertheless, \n some bumps \n occured along \n the road"
#
# p_time_series_Apple +
#     labs(title    = "Apple Fared Pretty Well",
#          subtitle = "Apple's Stock Price Over 13 Years",
#          y        = "Stock Price") +
#     geom_text(x     = as.Date("2019-01-15"),
#               y     = 80,
#               label = text_Apple,
#               size  = 3)

4 Our Second Plot - Scatter Plot

Next, we turn to one of the most basic, but also most useful plots - the scatter plot. Simply put, the only necessary adjustment we need to make to the previous plots we created is changing the geom used. As previously stated, the geom_…() functions decide on the sort of plot which we create. In order to do this, we need to choose what goes on the x-axis and what on the y-axis, or in other words, what we want to compare with each other. We map Tesla’s stock returns onto the y-axis and the NASDAQ returns on the x-axis, so we can see whether there’s any association between the two. First, however, we compute average mean returns for both financial instruments.

df_Tesla_SPY_NASDAQ_avg <- df_Tesla_SPY_NASDAQ %>%
    summarise(SPY_mean    = mean(ReturnsSPY, na.rm = T),
              TSLA_mean   = mean(ReturnsTSLA, na.rm = T),
              NASDAQ_mean = mean(ReturnsNASDAQ, na.rm = T))

geom_point() would be the go-to-option for standard scatter plots. However, we use geom_jitter() instead of geom_point() since this slightly and randomly dislocates individual observations in order to avoid overplotting, making the individual points better visible. As previously mentioned, returns of the NASDAQ go on the x-axis and returns of Tesla on the y-axis. We also highlight yesterday’s return, to see where it stands in comparison to historical returns. The if_else() function is pretty handy for this purpose.

p_scatter_Tesla_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = ReturnsNASDAQ, y = ReturnsTSLA)) +
    geom_jitter(aes(col = if_else(Date == max(Date, na.rm = T),
                                  "Today",
                                  "Historical")),
                alpha = 0.5) +  # geom_point()
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_manual(name   = "Date",
                       values = c(col_palette_blue[6], col_palette_red[7])) +
    labs(title    = "Is Tesla Related to the Broad Technology Market Index?",
         subtitle = "NASDAQ vs. TSLA Returns",
         x        = "NASDAQ Returns (Continuous)",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_scatter_Tesla_NASDAQ

Scatter plots are great to analyse the relationship between two (continuous) variables and are probably the most used charts in research and ML contexts. To check whether a linear relationship between returns of the NASDAQ and Tesla exist, we can in addition add a regression line with geom_smooth(). The method argument is set to lm for linear model. geom_smooth() automatically adds confidence bands, which is pretty handy.

p_scatter_Tesla_NASDAQ +
    geom_smooth(method  = "lm",
                formula = "y ~ x",
                col     = col_palette_red[7])

In case one needs the slope coefficient of the linear regression as well, we can add this to the plot by using text annotations (after having computed the coefficients of the linear regression model).

model_lm_Tesla_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    lm(ReturnsTSLA ~ ReturnsNASDAQ,
       data = .)

model_lm_Tesla_NASDAQ_coefs <- model_lm_Tesla_NASDAQ %>%
    coef() %>%
    round(digits = 3)

text_model_lm_Tesla_NASDAQ_coefs <- paste("Alpha:", model_lm_Tesla_NASDAQ_coefs[1], "\n",
                                          "Beta:",  model_lm_Tesla_NASDAQ_coefs[2],
                                          sep = " ")

p_scatter_Tesla_NASDAQ +
    geom_smooth(method  = "lm",
                formula = "y ~ x",
                col     = col_palette_red[7]) +
    annotate(geom  = "text",
             label = text_model_lm_Tesla_NASDAQ_coefs,
             x     = 0.08,
             y     = -0.15,
             col   = col_palette_red[7],
             size  = 3)

Did you notice how the regression line immediately became the center of our attention? This is due to the colour choice it’s mapped to in relation to other elements in the plot. Ideally, we use colours in a restrictive way to highlight specific and particularly important aspects in our visualisations.

By looking at the scatter plot and the dispersion of points, however, it is doubtful whether the relationship is truly linear. Thus, we can try to set another model, such as loess (local polynomial regression fitting), in geom_smooth(). loess is a non-linear model (curved regression line).

p_scatter_Tesla_NASDAQ +
    geom_smooth(method  = "loess",
                formula = "y ~ x",
                col     = col_palette_red[7])

Getting back to our relationship between NASDAQ and Tesla returns, when looking at these plots alone, it remains unclear what the true relationship between the returns is. All we can say, is that Tesla on average seems to perform better when the U.S. stock market also performs well. However, the more extreme the returns are, the more uncertainty there is about the relationship, as indicated by the wider confidence intervals. This is due to the comparably little observations we have for extreme returns.

#

4.1 Exercise 5: Create a scatter plot (e.g. with Apple stock returns and trading volumes) and fit a linear or non-linear model to gauge the relationship among the variables.

# df_Apple_data %>%
#     mutate(Return = log(Adjusted) - lag(log(Adjusted))) %>%
#     ggplot(aes(x = Return, y = Volume, )) +
#     geom_jitter(alpha = 0.5) +
#     geom_smooth(method = "loess",
#                 formula  = "y ~ x",
#                 col    = col_palette_red[7]) +
#     scale_x_continuous(labels = scales::percent) +
#     scale_y_continuous(labels = scales::dollar) +
#     labs(title    = "Non-Linear Relation",
#          subtitle = "Apple Stock Returns and Trading Volume")

5 Our Third Plot - Bar Chart of Tesla’s Stock Volume

To quickly demonstrate how to build a bar chart, we use the Volume variable in Tesla’s stock data. A bar chart can be created with geom_col(). In addition, we use colour highlighting to draw the viewers’ attention to certain aspects of the plot. Again, the if_else() function is ideal for this purpose.

p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Volume)) +
    geom_col(aes(fill = if_else(between(Date,
                                        as.Date("2013-05-09"),
                                        as.Date("2014-01-01")),
                                "Steep Increase",
                                "Normal"))) +
    labs(title    = "Trading Volume in Tesla Has Increased Greatly, Starting in Mid 2013...",
         subtitle = "Tesla Trading Volume",
         caption  = "© Data Science & Technology Club HSG") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data$Volume), by = 50e6)) +
    scale_fill_manual(name   = "Volume Level",
                      values = c("Normal"         = "grey36",
                                 "Steep Increase" = col_palette_red[7])) +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_bar_Tesla_stock_volume

We can play with the width argument in geom_col() to adjust the width of the bins plotted. If we need to highlight more than one time period, we can use case_when() instead of if_else(), which basically is a multi-option if-else statement.

p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Volume)) +
    geom_col(aes(fill = case_when(between(Date,
                                          as.Date("2013-05-09"),
                                          as.Date("2014-01-01")) ~ "Steep Increase",
                                  between(Date,
                                          as.Date("2020-01-01"),
                                          as.Date("2020-04-01")) ~ "Steep Increase",
                                  TRUE                           ~ "Normal")),
             width = 0.3) +
    labs(title    = "...and Again in Early 2020",
         subtitle = "Tesla Trading Volume",
         caption  = "© Data Science & Technology Club HSG") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data$Volume), by = 50e6)) +
    scale_fill_manual(name   = "Volume Level",
                      values = c("Normal"         = "grey36",
                                 "Steep Increase" = col_palette_red[7])) +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_bar_Tesla_stock_volume

5.1 Exercise 6: Create a bar plot (e.g. with Apple’s trading volume). Try to highlight some aspects of the bar plot.

# p_bar_plot_Apple_volume <- df_Apple_data %>%
#     ggplot(aes(x = Date, y = Volume)) +
#     geom_col() +
#     annotate(geom  = "segment",
#              x     = as.Date("2009-01-01"),
#              xend  = as.Date("2015-01-01"),
#              y     = 3e9,
#              yend  = 1e9,
#              arrow = grid::arrow(length = unit(0.25, "cm")),
#              col   = col_palette_red[7]) +
#     annotate(geom  = "text",
#              label = "Decline",
#              x     = as.Date("2012-01-01"),
#              y     = 2.3e9,
#              col   = col_palette_red[7]) +
#     labs(title    = "Has Interest in Apple Gone Down?",
#          subtitle = "Apple Stock Trading Volume",
#          caption  = "© Data Science & Technology Club HSG") +
#     scale_x_date(date_breaks = "1 year",
#                  date_labels = "%Y") +
#     scale_y_continuous(labels = scales::dollar,
#                        breaks = seq(from = 0, to = max(df_Apple_data$Volume), by = 5e8)) +
#     theme(legend.text      = element_text(),
#           plot.title       = element_text(face = "bold"),
#           axis.line        = element_line(size = 0.75),
#           panel.grid.major = element_line(size = 0.05),
#           panel.grid.minor = element_line(size = 0.05))
#
# p_bar_plot_Apple_volume

6 Histogram - Tesla Stock Returns

Histograms are an ideal tool for analysing the univariate distribution of a variable of interest. To visualise the distribution of Tesla’s stock returns over time, we thus create a histogram with geom_histogram().

p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = ReturnsTSLA)) +
    geom_histogram(bins  = 200,
                   col   = "white",
                   fill  = col_palette_blue[6],
                   alpha = 0.85) +
    labs(title    = "Is There Money to Make by Investing in Tesla Stock?",
         subtitle = "Histogram with Distribution of Tesla's Stock Returns",
         x        = "Continuous Returns",
         y        = "Count") +
    scale_x_continuous(label = scales::percent) +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_hist_Tesla

Then, we add a density to the distribution with geom_density(). In addition, we also have to adjust the y-axis of the plot from count to density in geom_histogram() in order to have the density and histogram counts on the same axis scales.

p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = ReturnsTSLA)) +
    geom_histogram(aes(y = ..density..),
                   bins  = 200,
                   col   = "white",
                   fill  = col_palette_blue[6],
                   alpha = 0.85) +
    geom_density(kernel = "gaussian",
                 col    = col_palette_blue[6]) +
    labs(title    = "Is There Money to Make by Investing in Tesla Stock?",
         subtitle = "Histogram with Distribution of Tesla's Stock Returns",
         x        = "Continuous Returns",
         y        = "Density") +
    scale_x_continuous(label = scales::percent) +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_hist_Tesla

Next, we add the average mean and median return over time. First, we compute both metrics from the data, then, we add them to the plot with the geom_vline() function (vline for vertical line). We also use our own manual colour scale with scale_color_manual().

Tesla_returns_mean   <- mean(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T)

Tesla_returns_median <- median(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T)

p_hist_Tesla <- p_hist_Tesla +
    geom_vline(aes(xintercept = Tesla_returns_mean,
                   col        = "Mean"),
               size = 1) +
    geom_vline(aes(xintercept = Tesla_returns_median,
                   col        = "Median"),
               size = 1) +
    scale_color_manual(name   = "Metric",
                       values = c("Mean"   = col_palette_red[7],
                                  "Median" = col_palette_green[7]))

p_hist_Tesla

We can also add text labels with the numerical values of mean and median return. We use the annotate() function for this and first prepare the text to be displayed in two separate objects. Additionally, we make sure the colour scale used in the text annotations matches the colours used in the plot itself.

text_Tesla_returns_mean <- paste("Mean Return: \n",
                                 round(Tesla_returns_mean * 100, digits = 3),
                                 "%",
                                 sep = "")

text_Tesla_returns_median <- paste("Median Return: \n",
                                   round(Tesla_returns_median * 100, digits = 3),
                                   "%",
                                   sep = "")

p_hist_Tesla +
    annotate(geom  = "text",
             x     = 0.15,
             y     = 15,
             label = text_Tesla_returns_mean,
             col   = col_palette_red[7],
             size  = 3) +
    annotate(geom  = "text",
             x     = 0.15,
             y     = 13,
             label = text_Tesla_returns_median,
             col   = col_palette_green[7],
             size  = 3)

# Determine y-axis density position of median, mean, and confidence intervals
#
# p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
#     ggplot(aes(x = ReturnsTSLA)) +
#     stat_density(aes(y = ..scaled..),
#                  geom   = "line",
#                  size   = 0.5,
#                  col    = col_palette_blue[6],
#                  adjust = 1) +
#     labs(title = "Histogram - Tesla Stock Returns",
#          x     = "Continuous Returns",
#          y     = "Count") +
#     scale_x_continuous(label = scales::percent) +
#     theme(legend.text = element_text(),
#           plot.title  = element_text(face = "bold"),
#           axis.line   = element_line(size = 0.75),
#           panel.grid.major = element_line(size = 0.05),
#           panel.grid.minor = element_line(size = 0.05))
#
# mean_se <- sd(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T) / sqrt(length(df_Tesla_SPY_NASDAQ$ReturnsTSLA))
#
# mean_conf_inter_l <- Tesla_returns_mean - 1.96 * mean_se
#
# mean_conf_inter_u <- Tesla_returns_mean + 1.96 * mean_se
#
# mean_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
#     slice(which.min(abs(x - Tesla_returns_mean))) %>%
#     pull(ndensity)
#
# mean_conf_inter_l_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
#     slice(which.min(abs(x - mean_conf_inter_l))) %>%
#     pull(ndensity)
#
# mean_conf_inter_u_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
#     slice(which.min(abs(x - mean_conf_inter_u))) %>%
#     pull(ndensity)
#
# p_hist_Tesla +
#     geom_segment(x = Tesla_returns_mean,
#                  xend = Tesla_returns_mean,
#                  y = 0,
#                  yend = mean_pos_y,
#                  linetype = "solid",
#                  color = col_palette_blue[6],
#                  size = 0.4) +
#     geom_point(x = Tesla_returns_mean,
#                y = mean_pos_y,
#                col = col_palette_blue[6])
    # geom_area(x = mean_conf_inter_l,
    #              xend = mean_conf_inter_u,
    #              y = mean_conf_inter_l_pos_y,
    #              yend = mean_conf_inter_u_pos_y,
    #              linetype = "solid",
    #              color = "grey",
    #              size = 0.4)

6.1 Exercise 7: Create a histogram (e.g. with Apple’s trading volumina) and add the mean or median as vertical lines.

# df_Apple_data %>%
#     ggplot(aes(x = Volume)) +
#     geom_histogram(aes(y = ..density..),
#                    bins  = 200,
#                    col   = "white",
#                    fill  = col_palette_blue[6],
#                    alpha = 0.85) +
#     geom_density(kernel = "gaussian",
#                  col    = col_palette_blue[6]) +
#     geom_vline(aes(xintercept = mean(df_Apple_data$Volume, na.rm = T),
#                    col = "Mean"),
#                size = 1) +
#     geom_vline(aes(xintercept = median(df_Apple_data$Volume, na.rm = T),
#                    col = "Median"),
#                size = 1) +
#     scale_x_continuous(labels = scales::dollar,
#                        breaks = seq(from = 0, to = max(df_Apple_data$Volume, na.rm = T), by = 5e8)) +
#     scale_color_manual(name   = "Metric",
#                        values = c("Mean"   = col_palette_red[7],
#                                   "Median" = col_palette_green[7])) +
#     labs(title    = "What is the Usual Dollar Trading Volume in Apple?",
#          subtitle = "Apple Stock Trading Volume Over Time",
#          x        = "Trading Volume in USD",
#          y        = "Density") +
#     theme(legend.text      = element_text(),
#           plot.title       = element_text(face = "bold"),
#           axis.line        = element_line(size = 0.75),
#           panel.grid.major = element_line(size = 0.05),
#           panel.grid.minor = element_line(size = 0.05))

7 Faceted Time Series Plot

If we want to display multiple series in a single plot, this is best done by using the ggplot2 facets component. It is applied as a separate component in our already existing time series plot, by adding either facet_wrap() or facet_grid(). The rest of the plot is coded the same way as you would do for a univariate plot. facet_wrap() or facet_grid() then take a variable to differentiate the different series from each other. Facetted plots are a great way to directly compare a number of similar data series (within the same sort of plot design) and they are not limited to time series graphs. It usually helps, if the various series have roughly the same axis scales and are shown in different colours. Before we can build a facet plot, however, some data wrangling is required to transform the data from wide to long format.

df_Tesla_stock_data_long <- df_Tesla_stock_data %>%
    select(-Volume) %>%
    pivot_longer(cols      = -Date,
                 names_to  = "Variable",
                 values_to = "Values")

p_time_series_Tesla_faceted <- df_Tesla_stock_data_long %>%
    ggplot(aes(x = Date, y = Values, col = Variable)) +
    geom_line() +
    facet_wrap(. ~ Variable) +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data_long$Values, na.rm = T), by = 100)) +
    scale_color_viridis(discrete = T) +
    labs(title   = "Faceted Stock Price Time Series - Tesla",
         y       = "Stock Price",
         caption = "© Data Science & Technology Club HSG") +
    theme(axis.text.x = element_text(angle = 60,
                                     vjust = 0.5),
          panel.grid.major = element_line(size = 0.1),
          panel.grid.minor = element_line(size = 0.05))

p_time_series_Tesla_faceted

7.1 Exercise 8: Create a facet plot with Apple’s stock data. Remember that it’s easiest if you transform the data first into long format.

# df_Apple_data_long <- df_Apple_data %>%
#     mutate(Volume = Volume / 1e8) %>%
#     pivot_longer(cols      = -Date,
#                  names_to  = "Variable",
#                  values_to = "Values")
#
# p_time_series_Apple_faceted <- df_Apple_data_long %>%
#     ggplot(aes(x = Date, y = Values, col = Variable)) +
#     geom_line() +
#     facet_wrap(. ~ Variable) +
#     scale_x_date(date_breaks = "1 year",
#                  date_labels = "%Y") +
#     scale_y_continuous(labels = scales::dollar,
#                        breaks = seq(from = 0, to = max(df_Apple_data_long$Values, na.rm = T), by = 20)) +
#     scale_color_viridis(discrete = T) +
#     labs(title   = "Faceted Stock Price and Trading Volume Time Series - Apple",
#          y       = "Stock Price or Dollar Trading Volume",
#          caption = "© Data Science & Technology Club HSG") +
#     theme(axis.text.x = element_text(angle = 60,
#                                      vjust = 0.5),
#           panel.grid.major = element_line(size = 0.1),
#           panel.grid.minor = element_line(size = 0.05))
#
# p_time_series_Apple_faceted

8 Interactive Plots with Plotly

To add some more spice to the previously built plots, we can turn them into interactive web graphs. This is where the plotly package comes in play. It is built around the plotly.js (Java Script) library and is extremely useful and versatile when it comes to interactive plots used in reports, dashboards or web pages. Furthermore, once we created a plot and saved it as a ggplot2 object, we can simply call the ggplotly() function on this plot - and voilà - we have an interactive graph!

p_time_series_Tesla_faceted %>%
    ggplotly()
# FIXME: Annotation doesn't work yet

p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY +
    theme_classic()

The ggplotly() function works with most sorts of plots created with ggplot2. With interactive plotly graphs, we can zoom in, hover over and highlight parts of the plot. The syntax for creating a plotly graph from scratch is slightly different from ggplot2, but nevertheless not too complicated. A range of additional options and plots are available by directly accessing the plotly interface.

p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY %>%
    ggplotly() %>%
    layout(annotations = list(x    = 1,
                              y    = 1,
                              text = "© Data Science & Technology Club HSG"))

p_time_series_Tesla_vs_SPY

8.1 Exercise 9: Turn one of our previously built plots into a plotly graph (and play around with it).

# p_hist_Tesla %>%
#     ggplotly()

9 Elon Musk’s Tweets

By now, we covered a range of useful plots when it comes to analysing numerical and structured data. We next turn to the gathered Tweets data, which is unstructured text data. As such, it needs some data wrangling and transforming before we can use it in any insightful way. We simply start by computing the number of tweets Elon Musk writes per day and show summary statistics thereof. This allows us to see, e.g., the maximum number of Tweets Musk wrote on a single day. Moreover, interestingly, in 2020, not a single day went by on which Musk did not twitter at all!

df_tweets_elon_musk_per_day <- df_tweets_elon_musk %>%
    mutate(Date = as.Date(created_at)) %>%
    group_by(Date) %>%
    summarise(TweetsN = n(),
              .groups = NULL)

df_tweets_elon_musk_per_day %>%
    summarise(Min            = min(TweetsN, na.rm = T),
              `1st Quartile` = quantile(TweetsN, probs = 0.25),
              Median         = median(TweetsN, na.rm = T),
              Mean           = round(mean(TweetsN, na.rm = T), digits = 2),
              `3rd Quartile` = quantile(TweetsN, , probs = 0.75),
              Max            = max(TweetsN, na.rm = T)) %>%
    datatable(caption = htmltools::tags$caption(tyle = "caption-side: bottom; text-align: center;",
                                                "Table 1: ",
                                                htmltools::em("Summary statistics of daily tweets by Elon Musk.")))

Next, we use our ggplot2 and plotly skills to create a bar plot for visualising the number of tweets per day.

p_bar_tweets_elon_musk <- df_tweets_elon_musk_per_day %>%
    ggplot(aes(x = Date, y = TweetsN, fill = TweetsN)) +
    geom_col() +
    scale_x_date(date_breaks = "1 month",
                 date_labels = "%Y %b") +
    scale_y_continuous(breaks = seq(0, 60, 10)) +
    labs(title = "How Many Tweets Does Elon Musk Write per Day?",
         x     = "Month",
         y     = "Number of Tweets") +
    scale_fill_binned(name = "Number of \nTweets by \nElon Musk", type = "viridis") +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          axis.text.x      = element_text(angle = 60,
                                          hjust = 1),
          panel.grid.major = element_line(size = 0.1),
          panel.grid.minor = element_line(size = 0.05))

p_bar_tweets_elon_musk <- p_bar_tweets_elon_musk %>%
    ggplotly()

p_bar_tweets_elon_musk

Now we get back to our main question and would like to compare the number of Tweets Musk writes per day to the evolution of Tesla’s stock price. In plotly, we can display two interactive plots together with subplot(). All objects called in subplot() need to be plotly graphs in order for this to work.

subplot(p_time_series_Tesla_vs_SPY,
        p_bar_tweets_elon_musk,
        nrows  = 2,
        shareX = T)

This is a first step towards the right direction. It’s however hard to see the direct influence of Musk’s tweets on Tesla’s stock price. In order to gauge whether the number of tweets by Musk per day are associated in any way with returns of Tesla’s stock, we (naively) try to visualise this with a scatter plot first. We also add a third axis with the colour encoding showing the day on which the respective stock return was recorded.

df_Tesla_EM_tweets <- df_Tesla_SPY_NASDAQ %>%
    full_join(df_tweets_elon_musk_per_day,
              by = "Date") %>%
    select(Date, ReturnsTSLA, TweetsN)

p_scatter_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
    filter(!is.na(TweetsN)) %>%
    ggplot(aes(x = ReturnsTSLA, y = TweetsN)) +
    geom_jitter(aes(col = Date)) +
    geom_vline(xintercept = 0,
               size       = 1,
               alpha      = 0.1) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(breaks = seq(0, max(df_tweets_elon_musk_per_day$TweetsN, na.rm = T), 10)) +
    scale_color_date(low  = col_palette_blue[7],
                     high = col_palette_green[7]) +
    labs(title    = "Number of Daily Tweets vs. TSLA Returns",
         subtitle = "Scatter Plot",
         x        = "TSLA Returns (Continuous)",
         y        = "Number of Tweets per Day",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_scatter_Tesla_EM_tweets %>%
    ggplotly()

The result pretty much looks like a random cloud of data points, so there seems to hardly be any association. To further collect evidence on this,we can add a (polynomial) regression line to check for the association. However, as we already suspected from the simple scatter plot, there is no direct relationship visible here.

p_scatter_Tesla_EM_tweets_reg <- p_scatter_Tesla_EM_tweets +
    geom_smooth(method  = "loess",
                formula = y ~ x,
                col     = col_palette_red[7])

p_scatter_Tesla_EM_tweets_reg

Another way to reach the same conclusion is with a boxplot. We thus produce a boxplot with the same underlying data as before. To do this, we need to sort the number of tweets into so-called “bins”. We choose a bin number of 12, thus splitting the number of tweets in bin widths of approximately 5.

p_boxplot_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
    mutate(TweetsN = cut(TweetsN, breaks = 12)) %>%
    filter_all(~ !is.na(.)) %>%
    ggplot(aes(x = TweetsN, y = ReturnsTSLA)) +
    geom_boxplot(col = col_palette_blue[6]) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_viridis_d() +
    labs(title    = "Number of Daily Tweets vs. TSLA Returns",
         subtitle = "Boxplot",
         x        = "Number of Tweets per Day",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          axis.text.x = element_text(angle = 90,
                                     vjust = 1))

p_boxplot_Tesla_EM_tweets %>%
    ggplotly()

We conclude that there is no clear association between the number of tweets per day and stock returns as there is no rising trend in the binned box plots (note that the plot axes are inverted here). So we can clearly say that our naive comparison of the number of tweets per day to stock returns shows no association.

# Get Tesla tweets

df_tweets_elon_musk_Tesla <- df_tweets_elon_musk %>%
    filter(str_detect(text, pattern = "Tesla"))

So now, let’s dive deeper and take a look at Musk’s infamous “taking-Tesla-private” tweet from 7 August 2018.

p_time_series_Tesla_private_tweet <- df_Tesla_stock_data %>%
    filter(between(Date,
                   as.Date("2018-07-01"),
                   as.Date("2018-09-14"))) %>%
    mutate(Date = as_datetime(Date, tz = "UTC")) %>%
    ggplot(aes(x = Date, y = Adjusted)) +
    geom_line(col = col_palette_blue[6]) +
    geom_point(col = col_palette_blue[6]) +
    geom_vline(xintercept = as_datetime("2018-08-07 12:48:00"),
               col        = col_palette_red[7],
               alpha      = 1)

    # annotate(geom  = "text",
    #          label = "High Volatility Period",
    #          x     = as.Date("2020-04-01"),
    #          y     = -3)

p_time_series_Tesla_private_tweet

# p_time_series_Tesla_private_tweet %>%
#     ggplotly()

We next tokenize the tweets into words. This enables us to quantitatively analyse them.

df_tweets_elon_musk_tokens <- df_tweets_elon_musk %>%
    unnest_tokens(output = words,
                  input  = text,
                  token  = "words")

When we count which words appear most often in the tweets, we see that they are common ones such as “to”, “the”, etc. These are known as stop words and it makes sense to remove them for a meaningful analysis.

df_tweets_elon_musk_tokens %>%
    count(words) %>%
    arrange(desc(n)) %>%
    datatable()

Let’s do just that and voila - the most frequently used words in the tweet now make much more sense and we can actually start using them for further analysis.

stop_words_custom <- tribble(~ word,   ~ lexicon,
                             "http",  "CUSTOM",
                             "https", "CUSTOM",
                             "t.co",  "CUSTOM",
                             "amp",   "CUSTOM",
                             "it’s",  "CUSTOM")

stop_words_final <- stop_words %>%
    bind_rows(stop_words_custom)

df_tweets_elon_musk_tokens_cleaned <- df_tweets_elon_musk_tokens %>%
    anti_join(stop_words_final,
              by = c("words" = "word"))

df_tweets_elon_musk_tokens_cleaned %>%
    count(words) %>%
    arrange(desc(n)) %>%
    datatable()

We visualise the number of times the words occur in Musk’s tweets first with a simple flipped bar plot.

p_bar_flipped_tweet_word_count <- df_tweets_elon_musk_tokens_cleaned %>%
    count(words) %>%
    filter(n >= 70) %>%
    ggplot(aes(x = fct_reorder(words, n), y = n)) +
    geom_col(aes(fill = if_else(str_detect(words, pattern = "tesla"),
                                "red",
                                "blue")),
             alpha = 0.85) +
    geom_text(aes(y     = n + 10,
                  label = n),
              size = 2.5) +
    coord_flip() +
    scale_fill_manual(values = c("red" = col_palette_red[7], "blue" = col_palette_blue[6])) +
    labs(title    = "Tesla Seems Indeed to be Important for Elon Musk… (It's All He Talks about All-Day Long!)",
         subtitle = "Word Counts in Elon Musk's Tweets",
         x        = "Word",
         y        = "Word Counts",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.position = "none",
          plot.title      = element_text(face = "bold"),
          axis.line       = element_line(size = 0.75))

p_bar_flipped_tweet_word_count

In addition, we can also add some more labels and descriptions for highlighting purposes.

label_text <- "Words related \n to Tesla \n in Elon Musk's \n tweets"

# p_bar_flipped_tweet_word_count +
#     geom_curve(x         = "launch",
#                xend      = "erdayastronaut",
#                y         = 250,
#                yend      = 385,
#                curvature = 0.3,
#                arrow     = arrow(length = unit(0.2, "cm"),
#                                  type   = "closed"),
#                col       = "gray36") +
#     geom_curve(x         = "launch",
#                xend      = "thirdrowtesla",
#                y         = 250,
#                yend      = 135,
#                curvature = 0.2,
#                arrow     = arrow(length = unit(0.2, "cm"),
#                                  type   = "closed"),
#                col       = "gray36") +
#     geom_curve(x         = "launch",
#                xend      = "teslaownerssv",
#                y         = 250,
#                yend      = 115,
#                curvature = 0.15,
#                arrow     = arrow(length = unit(0.2, "cm"),
#                                  type   = "closed"),
#                col       = "gray36") +
#     geom_curve(x         = "launch",
#                xend      = "teslarati",
#                y         = 250,
#                yend      = 100,
#                curvature = 0.1,
#                arrow     = arrow(length = unit(0.2, "cm"),
#                                  type   = "closed"),
#                col       = "gray36") +
#     geom_label(x       = "sciguyspace",
#                y       = 250,
#                label   = label_text,
#                size    = 5,
#                label.r = unit(0, "cm"))

10 Wordcloud Plot - Elon Musk’s Tweets

Let’s do a wordcloud plot.

df_tweets_elon_musk_tokens_cleaned %>%
    count(words) %>%
    filter(n >= 10) %>%
    arrange(desc(n)) %>%
    wordcloud2()

Star-shaped form

df_tweets_elon_musk_tokens_cleaned %>%
    count(words) %>%
    filter(n >= 10) %>%
    wordcloud2(shape = "star")

11 Sentiment Analysis - Elon Musk’s Tweets

To perform a sentiment analysis on the content of Elon Musk’s tweets, we match the tweets with the sentiment dictonary nrc and visualise the results.

df_tweets_elon_musk_tokens_sentiment_nrc <- df_tweets_elon_musk_tokens_cleaned %>%
    inner_join(get_sentiments("nrc"),
               by = c("words" = "word"))

df_tweets_elon_musk_tokens_sentiment_nrc_count <- df_tweets_elon_musk_tokens_sentiment_nrc %>%
    count(sentiment) %>%
    arrange(desc(n)) %>%
    mutate(colour = if_else(sentiment %in% c("positive", "trust", "anticipation", "joy"),
                            "green",
                            "red"))

p_bar_tweets_elon_musk_sentiment_nrc <- df_tweets_elon_musk_tokens_sentiment_nrc_count  %>%
    ggplot(aes(x = fct_reorder(sentiment, n), y = n, fill = colour)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("red" = col_palette_red[7], "green" = col_palette_green[7])) +
    labs(title    = "Musk Seems to Look on the Bright Side of Life…",
         subtitle = "Sentiment Analysis of Elon Musk's Tweets",
         x        = "Sentiment",
         y        = "Word Counts",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.position = "none",
          plot.title      = element_text(face = "bold"),
          axis.line       = element_line(size = 0.75))

p_bar_tweets_elon_musk_sentiment_nrc %>%
    ggplotly()

Let’s see whether there is an association between twitter sentiment and Tesla’s stock returns

df_tweet_sentiment <- df_tweets_elon_musk_tokens_sentiment_nrc %>%
    mutate(Date = as.Date(created_at)) %>%
    group_by(Date) %>%
    count(sentiment) %>%
    ungroup()

df_tweet_sentiment_wide <- df_tweet_sentiment %>%
    pivot_wider(id_cols = Date,
                names_from = "sentiment",
                values_from = "n")

Radar chart of sentiment analysis

df_tweets_elon_musk_tokens_sentiment_nrc_count_wide <- df_tweets_elon_musk_tokens_sentiment_nrc_count %>%
    select(-colour) %>%
    pivot_wider(names_from = "sentiment",
                values_from = "n") %>%
    add_row(positive = 2500, trust = 2500, anticipation = 2500, negative = 2500, joy = 2500,
            fear = 2500, sadness = 2500, anger = 2500, surprise = 2500, disgust = 2500) %>%
    add_row(positive = 0, trust = 0, anticipation = 0, negative = 0, joy = 0,
            fear = 0, sadness = 0, anger = 0, surprise = 0, disgust = 0)

df_tweets_elon_musk_tokens_sentiment_nrc_count_wide[c(2, 3, 1), ]  %>%
    radarchart(maxmin = T,
               axistype = 2,
               pcol   = col_palette_blue[6],
               pfcol  = rgb(0.2, 0.5, 0.6, 0.3),
               plwd   = 4,
               plty   = 1,
               cglcol = "grey",
               cglty  = 1,
               axislabcol = "gray50",
               caxislabels = seq(0, 20, 5),
               cglwd = 0.8,
               vlcex = 0.8,
               title  = "Sentiment Analysis - Elon Musk's Tweets")

Let’s check how accurately we matched words in the tweets to specific sentiments.

#

12 Machine Learning Model

Preparing Elon Musk’s tweet data for ML model

# df_tweet_data <- df_tweets_elon_musk_tokens_sentiment_nrc %>%
#     mutate(hashtags = unlist(hashtags),
#            Date     = as.Date(created_at)) %>%
#     arrange(Date) %>%
#     select(Date, source, is_quote, is_retweet,
#            favorite_count, retweet_count, words, sentiment)

Join tweets data with Tesla stock return data

# df_Tesla_stock_returns_data <- df_Tesla_stock_data %>%
#     transmute(Date,
#               Return = log(Adjusted) - lag(log(Adjusted)),
#               Volume)
#
# df_Tesla_stock_returns_tweets <- df_Tesla_stock_returns_data %>%
#     full_join(df_tweet_data,
#               by = "Date")

Filter for date at which we actually have tweets data

# df_Tesla_stock_returns_tweets <- df_Tesla_stock_returns_tweets %>%
#     filter(!is.na(Return),
#            Date >= "2019-11-25")

OLS linear regression models

# model_lm <- df_Tesla_stock_returns_tweets %>%
#     lm(Return ~ ., data = .)

# model_lm %>%
#     coeftest(vcov. = vcovHAC(.)) %>%
#     tidy() %>%
#     mutate(significance = case_when(p.value >= 0.1                    ~ "",
#                                     p.value < 0.1 & p.value >= 0.05   ~ ".",
#                                     p.value < 0.05 & p.value >= 0.01  ~ "*",
#                                     p.value < 0.01 & p.value >= 0.001 ~ "**",
#                                     p.value < 0.001                   ~ "***")) %>%
#     mutate_if(is.numeric, ~ round(., digits = 3)) %>%
#     datatable()

Set train control for caret ML model

# ctrl <- trainControl(method          = "repeatedcv",
#                      number          = 10,  # Number of folds
#                      repeats         = 5,   # Number of repeats (complete sets of folds)
#                      classProbs      = F,
#                      summaryFunction = defaultSummary,  # mnLogLoss,
#                      search         = "random",  # random hyperparameter search grid  # "grid"
#                      preProcOptions  = list(thresh = 0.95, ICAcomp = 3, k = 5, freqCut = 95/5,
#                                             uniqueCut = 10, cutoff = 0.9),
#                      verboseIter     = T,
#                      allowParallel   = T)

Random forest ML model

# system.time(
#     model_random_forest <- train(Return ~ .,
#                                  data           = df_Tesla_stock_returns_tweets,
#                                  method         = "rf",
#                                  na.action      = na.pass,
#                                  preProcess     = c("nzv", "medianImpute", "center", "scale"),  # knnImpute
#                                  trControl      = ctrl,
#                                  metric         = "RMSE")
#     )
#
# model_random_forest %>%
#     ggplot()

Plot variance importance of random forest model

# model_random_forest %>%
#     varImp() %>%
#     plot()

13 Candelstick Chart - Tesla’s Stock Price

Candlestick chart

df_Tesla_stock_data %>%
    plot_ly(x     = ~ Date,
            type  = "candlestick",
            open  = ~ Open,
            close = ~ Close,
            high  = ~ High,
            low   = ~ Low) %>%
    layout(title = "Candlestick Chart - Tesla Stock Price")
df_Tesla_stock_data %>%
    filter(Date >= "2020-01-01") %>%
    plot_ly(x     = ~ Date,
            type  = "candlestick",
            open  = ~ Open,
            close = ~ Close,
            high  = ~ High,
            low   = ~ Low) %>%
    layout(title = "Candlestick Chart - Tesla Stock Price")

OHLC

# TODO: Add nicer colors

# p_LC <- df_Tesla_stock_data %>%
#     ggplot(aes(x = Date, y = Adjusted)) +
#     geom_line(size = 1) +
#     geom_line(aes(y = Low),
#               col      = palette()[1],
#               linetype = "dashed") +
#     geom_line(aes(y = High),
#               col      = palette()[8],
#               linetype = "dashed") +
#     geom_ribbon(aes(ymin  = Low,
#                     ymax  = High),
#                 alpha = 0.4) +
#     labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time") +
#     scale_x_date(date_breaks = "1 year",
#                  date_labels = "%Y") +
#     scale_y_continuous(labels = scales::dollar)
#
# p_LC %>%
#     ggplotly()

14 Animated Plots

Simple animated time series plot with plotly abd the gapminder data

gapminder %>%
    filter(country %in% c("China", "United States", "United Kingdom", "India",
                          "Germany", "Switzerland", "Austria", "Japan", "Singapore")) %>%
    plot_ly(x      = ~ lifeExp,
            y      = ~ gdpPercap,
            size   = ~ pop,
            color  = ~ country,
            frame  = ~ year,
            type   = "scatter",
            mode   = "markers",
            colors = palette())

  1. Source: SEC Statement and Press Release on Tesla↩︎